Multivariate continuous data

In this lecture, we’ll focus on visualising large numbers of continuous variables.

Scatterplots vs boxplots

data(iris)
boxplot(iris[,4:1])

Boxplots are effective for getting an overview of major differences between variables - particularly in terms of their location (position vertically) and scale (length of the boxplot). However, boxplots are too simple to show whether a variable splits into different modes or groups. Boxplots are also fundamentally a 1-dimensional graphic - they cannot detect or display whether the variables are related.

pairs(iris[,4:1], pch=16)

Conversely, the scatterplot matrix exposes a great deal of the structure of the data, and makes relationships clear. Patterns, trends, and groups emerge quite easily and can be easily spotted by eye. But it is not as effective for comparing scales and locations - we still need the boxplots, even if they are limited!

Bubbleplots

What if we need to include a third numerical variable into a scatter plot? One way to achieve this is a bubble plot, where we use the value of a third variable to control the size of the points we draw on a scatterplot:

library(plotly)
plot_ly(iris, x=~Petal.Width, y=~Petal.Length, size=~Sepal.Width, color=~Species,
    sizes = c(1,50), type='scatter', mode='markers', marker = list(opacity = 0.7, sizemode = "diameter"))

Here we have positioned the points of the iris data according to the Petal measurements and used the size of the point to indicate the Sepal.Width. We can note that the orange points (versicolor) appear to have more small bubbles than the others, and the green points (setosa) look to have larger values.

This technique can be quite effective when the variable corresponding to point size corresponds to some measure of importance, scale, or size (such as population per country, number of samples in a survey, number of patients in a medical trial). While it can be effectively used in three-variable problems, for more variables than this we require other methods.

Dealing with more continuous variables will require scaling up a lot of the familiar techniques from earlier. Individual numerical summaries can still be computed, though it is even more difficult to easily compare. Some of the standard visualisations can be easily applied to many variables, such as histograms and box plots.

One particularly useful technique is to take scatterplots, and draw many of them in a matrix to effectively compare multiple variables at once. Scatterplot matrices are a matrix of scatterplots with each variable plotted against all of the others.

Like the grid or trellis scatterplot we produce an array of plots, but now we plot all variables simultaneously!

These give excellent initial overviews of the relationship between continuous variables in data sets with relatively small numbers of variables.

Example: Swiss Banknotes

Let’s use a data set on Swiss banknotes to illustrate variations we can make to a scatterplot matrix.

The data are six measurements on each of 100 genuine and 100 forged Swiss banknotes. The idea is to see if we can distinguish the real notes from the fakes on the basis of the notes dimensions only:

library(car)
data(bank,package="gclus")
scatterplotMatrix(bank[,-1], smooth=FALSE, regLine=FALSE, diagonal=TRUE, groups=bank$Status,
                  col=c(5,7), pch=c(16,16))

  • Blue=Genuine, Yellow=Fake.
  • We can use the diagonal elements to show histograms (or smoothed histograms)
  • Can we see any variables for which the two groups are well separated? “Diagonal” looks like a good candidate. Looking at the (smoothed) histograms in the Diagonal panel, we can see the two distributions barely overlap indicating a useful variable for separating the groups.
  • We can see some assocations and relationships here “Left”/“Right” are positively correlated, but “Top”/“Diagonal” are negatively associated.
  • There also seem to be a few possible outliers, not all are forgeries!

Challenge

  1. Type data(airquality) to load the built-in airquality data set
  2. Extract from the data set only the observations of the first four variables (i.e. exclude month and day)
  3. Use the scatterplotMatrix() function in the package car to produce a scatterplot matrix of the four variables in the data set
  4. Can you play around with the options to personalise the plot? For example, how do you include histograms of the variables in the diagonal?
Click for solution

The plot with default options looks like this:

data(airquality)
scatterplotMatrix(airquality[,-c(5,6)])

we can make this nicer by excluding some of the features and formatting the dots:

scatterplotMatrix(airquality[,-c(5,6)], smooth=FALSE, regLine=FALSE, pch=16)

If we want histograms on the diagonal, we have to make good use of the option diagonal.

scatterplotMatrix(airquality[,-c(5,6)], smooth=FALSE, regLine=FALSE, pch=16, diagonal=list(method ="histogram"))

You have lots and lots of options for what to put on the diagonal: See the help by typing ?scatterplotMatrix for an extensive list (you need to scroll down a little bit!)

Parallel Coordinate Plots

Parallel coordinate plots (PCP) have become a popular tool for highly-multivariate data.

Rather than using perpendicular axes for pairs of variables and points for data, we draw all variables on parallel vertical axes and connect data with lines.

The variables are standardised, so that a common vertical axis makes sense.

par(mfrow=c(1,2))
pairs(iris[,4:1],pch=16,col=c('#f8766d','#00ba38','#619cff')[iris$Species])

library(lattice)
parallelplot(iris[,1:4], col=c('#f8766d','#00ba38','#619cff')[iris$Species], horizontal=FALSE)

In general:

  • Each line represents one data point.

  • The height of each line above the variable label indicates the (relative) size of the value for that data point. The variables are transformed to a common scale to make comparison possible.

  • All the values for one observation are connected across the plot, so we can see how things change for individuals as we move from one variable to another. Note that this is sensitive to the ordering we choose for the variables in the plot – some orderings of the variables may give clearer pictures than others.

For these data:

  • We can see how the setosa species (red) is separated from the others on the Petal measurements in the PCP by the separation between the group of red lines and the rest.

  • Setosa (red) is also generally smaller than the others, except on Sepal Width where it is larger than the other species.

  • We can also pick out outliers, for instance one setosa iris has a particularly small value of Sepal Width compared to all the others.

Using parallel coordinate plots:

  • Reading and interpreting a parallel coordinate plot (PCP) is a bit more difficult than a scatterplot, but can be just as effective for bigger problems.

  • When we identify interesting features, we can then investigate them using more familiar graphics.

  • A PCP gives a quick overview of the univariate distribution for each variable, and so we can identify skewness, outliers, gaps and concentrations.

  • However, for pairwise properties and associations then it’s best to draw a scatterplot.

Example: Guardian university league table 2013

The Guardian newspaper in the UK publishes a ranking of British universities each year and it reported these data in May, 2012 as a guide for 2013. 120 Universities are ranked using a combination of 8 criteria, combined into an ‘Average Teaching Score’ used to form the ranking.

library(GDAdata)
data(uniranks)
## a little bit of data tidying
names(uniranks)[c(5,6,8,10,11,13)] <- c('AvTeach','NSSTeach','SpendPerSt','Careers',
                                        'VAddScore','NSSFeedb')
uniranks1 <- within(uniranks, StaffStu <- 1/(StudentStaffRatio))
## draw the scatterplot matrix
pairs(uniranks1[,c(5:8,10:14)])

We can see some obvious patterns and dependencies here. What about the parallel coordinate plot?

parallelplot(uniranks1[,c(5:8,10:14)], col='black',  horizontal=FALSE)

Can we learn anything from this crazy mess of spaghetti?

PCPs are most effective if we colour the lines to represent subgroups of the data. We can colour the Russell Group universities, which unsurprisingly are usually at the top (except on NSSFeedback!)

## create a new variable to represent the group we want
uniranks2 <- within(uniranks1,
                    Rus <- factor(ifelse(UniGroup=="Russell", "Russell", "not")))

parallelplot(uniranks2[,c(5:8,10:14)], col=c("#2297E6","#DF536B")[uniranks2$Rus],  horizontal=FALSE)

Using colour helps us to diffentiate the lines from the different data points more clearly. Now we can start to explore for features and patterns:

  • AvTeach is the overall ranking, so notice how high/low values here are connected to high/low values elsewhere.
  • There are some exceptions - the 3rd ranked university on AvTeach has a surprisingly low NSSTeaching score (note the rapidly descending red line)
  • Notice how the data ‘pinch’ at certain values on NSS Teach and NSSOverall leaving gaps - this suggests a granularity effect due to limited options of values on the NSS survey scores, e.g. integer scores out of 10. This creates heaping and gaps in the data values as the data is really ordinal.
  • Also, notice how most of the lines are moving upwards when connecting these NSSTeach and NSSOverall - this would suggest a positive correlation.
  • The extremes of these two NSS scores are rarely observed, and they also correspond strongly to extremes on the overall score and ranking.
  • The heavy concentration on low values on EntryTariff (entry requirements) and StaffStu (staff:student ratio) suggests strong skewness in the distributions

A scatterplot matrix corroborates most of these features, though we’re probably near the limit of what we can read and extract from such a plot!

pairs(uniranks2[,c(5:8,10:14)],col=c("#2297E6","#DF536B")[uniranks2$Rus], pch=16)

How to explore a complex data set (Boston Data)

The Boston housing data contains 1 binary, 1 discrete, and 12 numeric variables. Looking first at the categorical variables:

data(Boston, package='MASS')

par(mfrow=c(1,2))
for(i in c("chas","rad")){
  barplot(table(Boston[,i]), main=paste("Barchart of",i),col='#72abe3')
}

We note the following features which should probably be explored further:

Turning now to histograms of the continuous variables:

## select only the continuous variables
cts <- !(names(Boston) %in% c("chas","rad"))
par(mfrow=c(3,4))
for(i in names(Boston)[cts]){
  hist(Boston[,i], col='#72abe3', xlab='', ylab='', main=paste("Barchart of",i))
}

Here we identify the following features worthy of a closer look:

Good practices

  • Looking at individual variables only gives information on each variable in isolation.
  • Identifying dependency and relationships, requires a multivariate overview.
  • A scatterplot matrix is an effective place to start, allowing us to quickly identify any obvious relationships between variables but struggles with a lot of variables
  • Parallel coordinate plots can also be used, but are most useful when comparing subgroups across many variables.

Looking first at a scatterplot matrix of the continuous variables:

library(scales)
pairs(Boston[,cts],pch=16,col=alpha('black',0.2))

  • There is a curious “either-or” relationship between crim and zn
  • There are some strong dependencies among the variables nox, rm, age, and dis in the centre of the plot
  • There are clear signs of outliers on several variables
  • The spikes and gaps are clearly visible on tax, pratio, black and indus
  • If we’re still interested in mdev as a response variable, then there are further dependencies with other variables as shown in the bottom row/rightmost column.

Correlation and correlation plot (corrplot)

  • The scatterplot matrix is easily overloaded with many variables and cases.
  • Correlations provide a useful numerical summary, and the corrplot is a good way to visualise that information.
  • Note that the standard correlation coefficient is not a good measure of non-linear relationships or suitable for categorical variables! Alternative correlations exist for such variables, but these are not computed by default.
round(cor(Boston),2) ## round to 2dp
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44   -0.18
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
##         black lstat  medv
## crim    -0.39  0.46 -0.39
## zn       0.18 -0.41  0.36
## indus   -0.36  0.60 -0.48
## chas     0.05 -0.05  0.18
## nox     -0.38  0.59 -0.43
## rm       0.13 -0.61  0.70
## age     -0.27  0.60 -0.38
## dis      0.29 -0.50  0.25
## rad     -0.44  0.49 -0.38
## tax     -0.44  0.54 -0.47
## ptratio -0.18  0.37 -0.51
## black    1.00 -0.37  0.33
## lstat   -0.37  1.00 -0.74
## medv     0.33 -0.74  1.00
library(corrplot)
corrplot(cor(Boston))

There are clearly strong associations between age and dis, indus and dis, and lstat and medv that may be worth studying closer. chas appears almost uncorrelated to all the other variables – but remember chas was a binary variable, and so the correlation coefficient is meaningless here and we should not draw conclusions from this feature!

Heatmap

  • A heatmap represents the entire data set in a grid of coloured boxes.
  • Each column is a variable, and each row is a data point
  • Each cell is then a data value, coloured by its relative value to other values in the column.
  • Again, variables are rescaled so that each column is shaded from its minimum observed value to its maximum.
  • The dendrograms above/left suggest groupings of similar cases and variables.
library(gplots)
heatmap.2(as.matrix(Boston), scale='column',trace='none')

Here we can see: * The lower quarter of the heat map shows a lot of pale orange/yellow columns - the data points here typically take low values, and do so on multiple variables * The second column from the right appears to be roughly constant for most of the data, apart from a cluster of cases with very high (red) values.

Challenge

  1. Type data <- iris[, 1:4] to load the numerical values in the iris data set
  2. Normalise the data set by typing data <- scale(data)
  3. Produce a heatmap using the base R heatmap() function (no need to install any package). What do the colours represent?
  4. What are the small numbers to the right of the plot? What about those strange trees on top and to the left?
Click for solution
# Load and preprocess data
data <- iris[, 1:4]  # Select only numeric columns
data <- scale(data)   # Normalize the data

# Adjust margins
par(mar = c(20, 8, 2, 2))

# Create heatmap. Options needed to visualise the full names of the variables.
heatmap(data, cexCol = 0.9, labCol = colnames(data))

Colours represent departure from the mean of the variable values. Yellow=small values, Red=large values. Numbers on the right are observations, ordered in a way that groups the more similar together. The way grouping is done is shown in the dendrogram on the left. Dendrogram on the top is similar, but for similarities between variables.

Here is a simpler version of the heatmap (for advanced graphics sometimes you need extra options just to get rid of the unwanted information!)

heatmap(data, 
        Colv = NA, Rowv = NA,  # Remove dendrograms
        col = heat.colors(256), # Use a nice color palette
        main = "Simple Heatmap of Iris Data",
        cexCol = 0.9, # Increase column label size
        labCol = colnames(data))